Goals of this Notebook

About the data

I collected cheese price data from the United States Department of Agriculture (USDA). It ranges from 2000 to 2022, but structure of it changed after 2018. From 2000 to 2018, data was aggregated in one sheet. After 2018, the USDA created an API that recorded each year’s data individually for each kind of cheese. I’ll need to merge these files together.

One thing to note is that I’ll be focusing on the import (not domestic) prices for this cheese to see how it trends over time. I also want to note that the price is in dollars per pound. You can also check out these FAQs about USDA market data.

Setup

I’m loading the tidyverse library and lubridate package so I can clean up my data.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.0      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(lubridate)
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(dplyr)
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

Import data

I’ll import my first file here, which ranges from January 2000 to February 2018. I’ll filter it so I’m looking at just the domestic cheese. I’ll also remove repeated rows.

Then, I’ll fix my dates so R reads them as dates.

thru_2018_clean <- read_csv("data-raw/thru_2018.csv") |> 
  distinct() |> #I'm using this to remove duplicate rows
  clean_names() |> 
  filter(
    region != "FOREIGN" #I don't want the foreign cheese
    ) |> 
  mutate(
    date = mdy(date)
  ) |> 
  select(
    date, type, high_price, low_price, region)
## Rows: 48237 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Date, Region, Commodity, Type
## dbl (8): Report Number, Report Year, Low Price, High Price, Imported Low Pri...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
thru_2018_clean |> glimpse()
## Rows: 16,671
## Columns: 5
## $ date       <date> 2000-01-07, 2000-01-07, 2000-01-07, 2000-01-07, 2000-01-07…
## $ type       <chr> "CHEDDAR 10# PRINTS", "CHEDDAR 40# BLOCK", "CHEDDAR SINGLE …
## $ high_price <dbl> 1.6725, 1.4850, 1.6225, 2.5500, 1.6675, 1.5175, 1.6225, 2.1…
## $ low_price  <dbl> 1.1825, 1.3075, 1.1625, 2.3500, 1.3650, 1.3650, 1.3850, 1.8…
## $ region     <chr> "NORTHEAST", "NORTHEAST", "NORTHEAST", "NORTHEAST", "NORTHE…

Now I’ll import all the sheets that came after 2018. I put the midwest, northeast and west files for each year following 2018 in a folder called file-list and import them all here.

list_of_files <- list.files(path = "file-list", recursive = TRUE,
                            pattern = "\\.csv$", 
                            full.names = TRUE) #imports the file list

cheese_dirty <- list_of_files |> 
  purrr::set_names(nm = (basename("file-list") |>  tools::file_path_sans_ext())
) |> 
  purrr::map_df(read_csv, 
                col_names = TRUE) #sets names
## Rows: 397 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (31): group, category, report begin date, report end date, published dat...
## dbl  (3): slug id, price min, price max
## lgl  (3): lot Desc, mostly low price, mostly high price
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 408 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (31): group, category, report begin date, report end date, published dat...
## dbl  (3): slug id, price min, price max
## lgl  (3): lot Desc, mostly low price, mostly high price
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 416 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (31): group, category, report begin date, report end date, published dat...
## dbl  (3): slug id, price min, price max
## lgl  (3): lot Desc, mostly low price, mostly high price
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 416 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (31): group, category, report begin date, report end date, published dat...
## dbl  (3): slug id, price min, price max
## lgl  (3): lot Desc, mostly low price, mostly high price
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 352 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (31): group, category, report begin date, report end date, published dat...
## dbl  (3): slug id, price min, price max
## lgl  (3): lot Desc, mostly low price, mostly high price
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 217 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (31): group, category, report begin date, report end date, published dat...
## dbl  (3): slug id, price min, price max
## lgl  (3): lot Desc, mostly low price, mostly high price
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 208 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (30): group, category, report begin date, report end date, published dat...
## dbl  (3): slug id, price min, price max
## lgl  (4): lot Desc, origin, mostly low price, mostly high price
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 208 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (30): group, category, report begin date, report end date, published dat...
## dbl  (3): slug id, price min, price max
## lgl  (4): lot Desc, origin, mostly low price, mostly high price
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 208 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (30): group, category, report begin date, report end date, published dat...
## dbl  (3): slug id, price min, price max
## lgl  (4): lot Desc, origin, mostly low price, mostly high price
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 176 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (30): group, category, report begin date, report end date, published dat...
## dbl  (3): slug id, price min, price max
## lgl  (4): lot Desc, origin, mostly low price, mostly high price
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 275 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (31): group, category, report begin date, report end date, published dat...
## dbl  (3): slug id, price min, price max
## lgl  (3): lot Desc, mostly low price, mostly high price
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 260 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (30): group, category, report begin date, report end date, published dat...
## dbl  (3): slug id, price min, price max
## lgl  (4): lot Desc, origin, mostly low price, mostly high price
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 260 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (30): group, category, report begin date, report end date, published dat...
## dbl  (3): slug id, price min, price max
## lgl  (4): lot Desc, origin, mostly low price, mostly high price
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 260 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (30): group, category, report begin date, report end date, published dat...
## dbl  (3): slug id, price min, price max
## lgl  (4): lot Desc, origin, mostly low price, mostly high price
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 220 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (30): group, category, report begin date, report end date, published dat...
## dbl  (3): slug id, price min, price max
## lgl  (4): lot Desc, origin, mostly low price, mostly high price
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cheese_dirty

Now I’ll get rid of unnecessary columns and fix my dates. I’ll also filter the date range so it will pick up where the other sheet left off.

cheese_clean <- cheese_dirty |> 
  clean_names() |> 
    mutate(
    date = mdy(report_begin_date),
    low_price = price_min,
    high_price = price_max,
    type = group
) |> 
  filter(date > 02-05-2018,
         application == "Domestic" #to remove foreign cheese
) |> 
  select(date, region, low_price, high_price, type)

cheese_clean |> glimpse()
## Rows: 2,696
## Columns: 5
## $ date       <date> 2018-12-31, 2018-12-31, 2018-12-31, 2018-12-31, 2018-12-31…
## $ region     <chr> "Wisconsin", "Wisconsin", "Wisconsin", "Wisconsin", "Wiscon…
## $ low_price  <dbl> 1.7350, 1.5350, 1.7100, 1.4625, 1.7350, 2.7525, 2.0025, 1.3…
## $ high_price <dbl> 2.1600, 2.4750, 1.9150, 1.8575, 2.1600, 2.8700, 3.0725, 1.7…
## $ type       <chr> "Muenster", "Mozzarella", "Monterey Jack", "Cheddar", "Bric…

Now I’ll bind my 2000 to 2018 data with my data from 2018 on.

all_cheese <- cheese_clean |> 
  bind_rows(thru_2018_clean)

all_cheese |> summary() #check my date range
##       date               region            low_price       high_price   
##  Min.   :2000-01-07   Length:19367       Min.   :1.005   Min.   :1.272  
##  1st Qu.:2004-11-19   Class :character   1st Qu.:1.630   1st Qu.:1.970  
##  Median :2009-10-09   Mode  :character   Median :1.933   Median :2.288  
##  Mean   :2010-07-02                      Mean   :1.999   Mean   :2.377  
##  3rd Qu.:2016-01-22                      3rd Qu.:2.275   3rd Qu.:2.685  
##  Max.   :2022-10-31                      Max.   :4.192   Max.   :4.535  
##      type          
##  Length:19367      
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

Adjust for Inflation

Now I want to adjust my prices for inflation. I’ll be using the average yearly Consumer Price Index for Urban Consumers, or CPI-U from the Bureau of Labor Statistics to calculate each year’s cheese prices adjusted to 2015 dollars. First, I need to make a year column.

cheese_yrs <- all_cheese |>
  mutate(
    yr = year(date))

cheese_yrs

Now I’ll divide the average CPI for each year by the 2015 CPI. Then, I’ll multiply it by each price in order to get a column where every price in my datast is adjusted to 2015 dollars.

A CPI estimate for 2022 is based on the change in the CPI from first quarter 2021 to first quarter 2022.

adj_cheese <- cheese_yrs |>
  mutate(
    adj_high_price = case_when(
      yr == 2000 ~ high_price*237.017/172.2,
      yr == 2001 ~ high_price*237.017/177.1,
      yr == 2002 ~ high_price*237.017/179.9,
      yr == 2003 ~ high_price*237.017/184.0,
      yr == 2004 ~ high_price*237.017/188.9,
      yr == 2005 ~ high_price*237.017/195.3,
      yr == 2006 ~ high_price*237.017/201.6,
      yr == 2007 ~ high_price*237.017/207.3,
      yr == 2008 ~ high_price*237.017/215.3,
      yr == 2009 ~ high_price*237.017/214.3,
      yr == 2010 ~ high_price*237.017/218.1,
      yr == 2011 ~ high_price*237.017/224.9,
      yr == 2012 ~ high_price*237.017/229.6,
      yr == 2013 ~ high_price*237.017/233.0,
      yr == 2014 ~ high_price*237.017/236.7,
      yr == 2015 ~ high_price*237.017/237.0,
      yr == 2016 ~ high_price*237.017/240.0,
      yr == 2017 ~ high_price*237.017/245.1,
      yr == 2018 ~ high_price*237.017/251.1,
      yr == 2019 ~ high_price*237.017/255.7,
      yr == 2020 ~ high_price*237.017/258.8,
      yr == 2021 ~ high_price*237.017/271.0,
      yr == 2022 ~ high_price*237.017/294.4
      ),
  adj_low_price = case_when(
      yr == 2000 ~ low_price*237.017/172.2,
      yr == 2001 ~ low_price*237.017/177.1,
      yr == 2002 ~ low_price*237.017/179.9,
      yr == 2003 ~ low_price*237.017/184.0,
      yr == 2004 ~ low_price*237.017/188.9,
      yr == 2005 ~ low_price*237.017/195.3,
      yr == 2006 ~ low_price*237.017/201.6,
      yr == 2007 ~ low_price*237.017/207.3,
      yr == 2008 ~ low_price*237.017/215.3,
      yr == 2009 ~ low_price*237.017/214.3,
      yr == 2010 ~ low_price*237.017/218.1,
      yr == 2011 ~ low_price*237.017/224.9,
      yr == 2012 ~ low_price*237.017/229.6,
      yr == 2013 ~ low_price*237.017/233.0,
      yr == 2014 ~ low_price*237.017/236.7,
      yr == 2015 ~ low_price*237.017/237.0,
      yr == 2016 ~ low_price*237.017/240.0,
      yr == 2017 ~ low_price*237.017/245.1,
      yr == 2018 ~ low_price*237.017/251.1,
      yr == 2019 ~ low_price*237.017/255.7,
      yr == 2020 ~ low_price*237.017/258.8,
      yr == 2021 ~ low_price*237.017/271.0,
      yr == 2022 ~ low_price*237.017/294.4
    ))

adj_cheese |> summary() #checking columns
##       date               region            low_price       high_price   
##  Min.   :2000-01-07   Length:19367       Min.   :1.005   Min.   :1.272  
##  1st Qu.:2004-11-19   Class :character   1st Qu.:1.630   1st Qu.:1.970  
##  Median :2009-10-09   Mode  :character   Median :1.933   Median :2.288  
##  Mean   :2010-07-02                      Mean   :1.999   Mean   :2.377  
##  3rd Qu.:2016-01-22                      3rd Qu.:2.275   3rd Qu.:2.685  
##  Max.   :2022-10-31                      Max.   :4.192   Max.   :4.535  
##      type                 yr       adj_high_price  adj_low_price   
##  Length:19367       Min.   :2000   Min.   :1.255   Min.   :0.9983  
##  Class :character   1st Qu.:2004   1st Qu.:2.146   1st Qu.:1.8035  
##  Mode  :character   Median :2009   Median :2.509   Median :2.0646  
##                     Mean   :2010   Mean   :2.591   Mean   :2.1800  
##                     3rd Qu.:2016   3rd Qu.:2.945   3rd Qu.:2.4907  
##                     Max.   :2022   Max.   :4.521   Max.   :4.1981

Export my data

adj_cheese |> write_rds("data-processed/01-cleaning.rds")